258

17

Genomics

aligning multiple sequences, degrees of kinship can be assigned on the basis of the

score, which has the form

total score equals score for aligned pairs plus score for gaps periodtotal score = score for aligned pairs + score for gaps.

(17.1)

The score is, in effect, the relative likelihood that a pair of sequences is related.

It represents distance, together with the operations (mutations and introduction of

gaps) required to edit one sequence onto the other. Sequence alignment attempts to

maximize the number of matches while minimizing the number of mutations and gaps

required in the editing process. Unfortunately, the relative weights of the terms on the

right-hand side of (17.1) are arbitrary. The main approach to assigning weights to the

terms more objectively is to study many extant sequences from organisms one knows

from independent evidence to be related. In principle, under a given set of conditions

(e.g., a certain level of exposure to cosmic rays), a given mutation presumably has a

definite probability of occurrence; that is, it can, at least in principle, be derived from

an objective set of data according to the frequentist interpretation, but the practical

difficulties and the possibility that such probabilities may be specific to the sequence

neighbouring the mutation make this an unpromising approach.

While with DNA sequences, a nucleotide is—at least to a first approximation—

either matched or not, with polypeptides a substitution might be sufficiently close

chemically so as to be functionally neutral. Hence, if alignments are carried out at

the level of amino acids, exact matches and substitutions are dealt with by compiling

an empirical table, based on chemical or biological knowledge or both, of degrees of

equivalence. 17 There is no uniquely optimal table. To construct one, a good starting

point is the table of amino acids (Table 15.6). Isoleucine should have about the same

score for substitution by leucine as for an exact match and so forth; substitution of

a polar for an apolar group or lysine for glutamic acid (say) would be given low or

negative scores. The biological approach is to look at the frequencies of the different

substitutions in pairs of proteins that can be considered to be functionally equivalent

from independent evidence (e.g., two enzymes that catalyse the same reaction).

In essence, the entries in a scoring matrix are numbers related to the probability of

a residue occurring in an alignment. Typically, they are calculated as (the logarithm

of) the probability of the “meaningful” occurrence of a pair of residues divided by

the probability of random occurrence. Probabilities of “meaningful” occurrences are

derived from actual alignments “known to be valid”. The inherent circularity of this

procedure gives it a temporary and provisional air.

In the case of gaps, the (negative) score might be a single value per gap or could

have two parameters: one for starting a gap, and another, multiplied by the gap length,

for continuing it (called an affine gap cost). This takes some slight account of possible

correlations in the history of changes presumed to have been responsible for causing

the divergence in sequences. The scoring of substitutions considers each mutation to

be an independent event, however.

17 For example, BLOSUM50, a 20 times 2020 × 20 score matrix (histidine scores 10 if replacing histidine,

glutamine 0, alanineminus3, and so on). The diagonal terms are not equal.